Internal Pattern Matching Queries in a Text and Applications
نویسندگان
چکیده
We consider several types of internal queries: questions about subwords of a text. As the main tool we develop an optimal data structure for the problem called here internal pattern matching. This data structure provides constant-time answers to queries about occurrences of one subword x in another subword y of a given text, assuming that |y| = O(|x|), which allows for a constant-space representation of all occurrences. This problem can be viewed as a natural extension of the well-studied pattern matching problem. The data structure has linear size and admits a linear-time construction algorithm. Using the solution to the internal pattern matching problem, we obtain very efficient data structures answering queries about: primitivity of subwords, periods of subwords, general substring compression, and cyclic equivalence of two subwords. All these results improve upon the best previously known counterparts. The linear construction time of our data structure also allows to improve the algorithm for finding δ-subrepetitions in a text (a more general version of maximal repetitions, also called runs). For any fixed δ we obtain the first linear-time algorithm, which matches the linear time complexity of the algorithm computing runs. Our data structure has already been used as a part of the efficient solutions for subword suffix rank & selection, as well as substring compression using Burrows-Wheeler transform composed with run-length encoding. The model of internal queries in texts is connected to the well-studied problem of text indexing. Both models have their origins in the introduction of suffix trees. However, there is an important difference: in our model the size of the representation of a query is constant and therefore enables faster query time. Our results can be viewed as efficient solutions to “internal” equivalents of several basic problems of regular pattern matching and make an improvement in a majority of already published results related to internal queries.
منابع مشابه
Practical Authenticated Pattern Matching with Optimal Proof Size
We address the problem of authenticating pattern matching queries over textual data that is outsourced to an untrusted cloud server. By employing cryptographic accumulators in a novel optimal integritychecking tool built directly over a suffix tree, we design the first authenticated data structure for verifiable answers to pattern matching queries featuring fast generation of constant-size proo...
متن کاملOptimal Data Structure for Internal Pattern Matching Queries in a Text and Applications
We present a linear-space data structure which enables very fast (usually constant time) answers to several types of internal queries — questions about factors (also called substrings) of a text. A factor-in-factor occurrence query asks for a representation of the set of all occurrences of one factor x in another factor y of the same text v of length n. It assumes that |y| = O(|x|), in this cas...
متن کاملEngineering a Distributed Full-Text Index
We present a distributed full-text index for big data applications in a distributed environment. The index can be used to answer different types of pattern matching queries (existential, counting and enumeration) and also be extended to answer document retrieval queries (counting, retrieve and top-k). We also show that succinct data structures are indeed useful for big data applications, as the...
متن کاملOrdered Pattern Matching: towards Full-text Retrieval
A typical query in Information Retrieval consists of multiple keywords, and the target is to retrieve matching documents. Various selection heuristics and ranking criteria are based on the proximity of keywords. For this purpose, it helps if the positions of the keywords in the document are listed in sorted order. For traditional documents consisting of lists of words, existing data structures ...
متن کاملParameterized Pattern Matching - Succinctly
The fields of succinct data structures and compressed text indexing have seen quite a bit of progress over the last 15 years. An important achievement, primarily using techniques based on the Burrows-Wheeler Transform (BWT), was obtaining the full functionality of suffix tree in the optimal number of bits. A crucial property that allows the use of BWT for designing compressed indexes is order-p...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2015